2025-05-16-12-04
A Multimodal Multi-Agent Framework for Radiology Report Generation
Abstract
arXiv:2505.09787v1 Announce Type: new Abstract: Radiology report generation (RRG) aims to automatically produce diagnostic reports from medical images, with the potential to enhance clinical workflows and reduce radiologists' workload. While recent approaches leveraging multimodal large language models (MLLMs) and retrieval-augmented generation (RAG) have achieved strong results, they continue to face challenges such as factual inconsistency, hallucination, and cross-modal misalignment. We propose a multimodal multi-agent framework for RRG that aligns with the stepwise clinical reasoning workflow, where task-specific agents handle retrieval, draft generation, visual analysis, refinement, and synthesis. Experimental results demonstrate that our approach outperforms a strong baseline in both automatic metrics and LLM-based evaluations, producing more accurate, structured, and interpretable reports. This work highlights the potential of clinically aligned multi-agent frameworks to support explainable and trustworthy clinical AI applications.
摘要
放射学报告生成(RRG)旨在通过医学影像自动生成诊断报告,具有优化临床工作流程和减轻放射科医生工作负荷的潜力。尽管当前基于多模态大语言模型(MLLMs)和检索增强生成(RAG)的方法已取得显著成果,但仍面临事实不一致、幻觉生成及跨模态失准等挑战。本研究提出一种符合临床分步推理流程的多模态多智能体框架,通过任务专用智能体分别处理检索、草稿生成、视觉分析、精炼与合成等环节。实验结果表明,该方法在自动指标和大语言模型评估中均优于强基线模型,能生成更准确、结构化且可解释的报告。本工作揭示了临床导向的多智能体框架在支持可解释、可信赖临床人工智能应用方面的潜力。
Demystifying AI Agents: The Final Generation of Intelligence
Abstract
arXiv:2505.09932v1 Announce Type: new Abstract: The trajectory of artificial intelligence (AI) has been one of relentless acceleration, evolving from rudimentary rule-based systems to sophisticated, autonomous agents capable of complex reasoning and interaction. This whitepaper chronicles this remarkable journey, charting the key technological milestones--advancements in prompting, training methodologies, hardware capabilities, and architectural innovations--that have converged to create the AI agents of today. We argue that these agents, exemplified by systems like OpenAI's ChatGPT with plugins and xAI's Grok, represent a culminating phase in AI development, potentially constituting the "final generation" of intelligence as we currently conceive it. We explore the capabilities and underlying technologies of these agents, grounded in practical examples, while also examining the profound societal implications and the unprecedented pace of progress that suggests intelligence is now doubling approximately every six months. The paper concludes by underscoring the critical need for wisdom and foresight in navigating the opportunities and challenges presented by this powerful new era of intelligence.
摘要
人工智能(AI)的发展轨迹始终呈现加速态势,已从基于简单规则的系统演变为具备复杂推理与交互能力的自主智能体。本白皮书系统梳理了这一演进历程,重点分析了促成当代AI智能体的关键技术里程碑——包括提示工程、训练方法、硬件能力及架构创新等领域的突破性进展。我们认为,以OpenAI插件版ChatGPT和xAI的Grok为代表的智能体,标志着AI发展可能已进入终极阶段,或将成为当前认知框架下的"最终代际"智能形态。通过具体案例,我们深入探讨了这些智能体的核心能力与技术基础,同时剖析了其带来的深远社会影响。研究指出,智能水平正以约每六个月翻倍的速度跃进,这种前所未有的发展速度要求我们以高度的智慧与远见来应对这一强大智能新时代所带来的机遇与挑战。
Unlocking Location Intelligence: A Survey from Deep Learning to The LLM Era
Abstract
arXiv:2505.09651v1 Announce Type: new Abstract: Location Intelligence (LI), the science of transforming location-centric geospatial data into actionable knowledge, has become a cornerstone of modern spatial decision-making. The rapid evolution of Geospatial Representation Learning is fundamentally reshaping LI development through two successive technological revolutions: the deep learning breakthrough and the emerging large language model (LLM) paradigm. While deep neural networks (DNNs) have demonstrated remarkable success in automated feature extraction from structured geospatial data (e.g., satellite imagery, GPS trajectories), the recent integration of LLMs introduces transformative capabilities for cross-modal geospatial reasoning and unstructured geo-textual data processing. This survey presents a comprehensive review of geospatial representation learning across both technological eras, organizing them into a structured taxonomy based on the complete pipeline comprising: (1) data perspective, (2) methodological perspective and (3) application perspective. We also highlight current advancements, discuss existing limitations, and propose potential future research directions in the LLM era. This work offers a thorough exploration of the field and providing a roadmap for further innovation in LI. The summary of the up-to-date paper list can be found in https://github.com/CityMind-Lab/Awesome-Location-Intelligence and will undergo continuous updates.
摘要
位置智能(LI)作为将基于位置的地理空间数据转化为可操作知识的科学,已成为现代空间决策的基石。地理空间表征学习的快速发展正通过两次连续的技术革命从根本上重塑LI的发展:深度学习突破与新兴的大语言模型(LLM)范式。尽管深度神经网络(DNN)在从结构化地理空间数据(如卫星影像、GPS轨迹)中自动提取特征方面表现出显著成效,但LLM的近期整合为跨模态地理空间推理和非结构化地理文本数据处理带来了变革性能力。本文综述全面审视了两个技术时代下的地理空间表征学习,基于包含以下环节的完整流程构建了结构化分类体系:(1) 数据视角,(2) 方法视角,(3) 应用视角。我们同时强调了当前进展,讨论了现存局限,并提出了LLM时代潜在的未来研究方向。这项工作提供了对该领域的深入探索,并为LI的进一步创新绘制了路线图。
AI Greenferencing: Routing AI Inferencing to Green Modular Data Centers with Heron
Abstract
arXiv:2505.09989v1 Announce Type: new Abstract: AI power demand is growing unprecedentedly thanks to the high power density of AI compute and the emerging inferencing workload. On the supply side, abundant wind power is waiting for grid access in interconnection queues. In this light, this paper argues bringing AI workload to modular compute clusters co-located in wind farms. Our deployment right-sizing strategy makes it economically viable to deploy more than 6 million high-end GPUs today that could consume cheap, green power at its source. We built Heron, a cross-site software router, that could efficiently leverage the complementarity of power generation across wind farms by routing AI inferencing workload around power drops. Using 1-week ofcoding and conversation production traces from Azure and (real) variable wind power traces, we show how Heron improves aggregate goodput of AI compute by up to 80% compared to the state-of-the-art.
摘要
由于AI计算的高功率密度和新兴推理工作负载,其电力需求正经历前所未有的增长。在供应端,大量风电资源正等待通过互联队列接入电网。基于此,本文提出将AI工作负载部署于风电场的模块化计算集群中。我们的部署规模优化策略使得当前部署超过600万块高端GPU具有经济可行性,这些GPU可直接利用廉价、绿色的源头电力。我们开发了Heron——一个跨站点软件路由器,它能够通过根据电力波动动态调度AI推理任务,高效利用不同风电场间的发电互补性。基于Azure为期一周的编码与对话生产轨迹及(真实)可变风电数据,我们证明相较于现有最优方案,Heron能将AI计算的聚合有效吞吐量提升最高达80%。
Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents
Abstract
arXiv:2505.09970v1 Announce Type: new Abstract: The ReAct (Reasoning + Action) capability in large language models (LLMs) has become the foundation of modern agentic systems. Recent LLMs, such as DeepSeek-R1 and OpenAI o1/o3, exemplify this by emphasizing reasoning through the generation of ample intermediate tokens, which help build a strong premise before producing the final output tokens. In this paper, we introduce Pre-Act, a novel approach that enhances the agent's performance by creating a multi-step execution plan along with the detailed reasoning for the given user input. This plan incrementally incorporates previous steps and tool outputs, refining itself after each step execution until the final response is obtained. Our approach is applicable to both conversational and non-conversational agents. To measure the performance of task-oriented agents comprehensively, we propose a two-level evaluation framework: (1) turn level and (2) end-to-end. Our turn-level evaluation, averaged across five models, shows that our approach, Pre-Act, outperforms ReAct by 70% in Action Recall on the Almita dataset. While this approach is effective for larger models, smaller models crucial for practical applications, where latency and cost are key constraints, often struggle with complex reasoning tasks required for agentic systems. To address this limitation, we fine-tune relatively small models such as Llama 3.1 (8B & 70B) using the proposed Pre-Act approach. Our experiments show that the fine-tuned 70B model outperforms GPT-4, achieving a 69.5% improvement in action accuracy (turn-level) and a 28% improvement in goal completion rate (end-to-end) on the Almita (out-of-domain) dataset.
摘要
大型语言模型(LLMs)中的ReAct(推理+行动)能力已成为现代代理系统的基础。近期诸如DeepSeek-R1和OpenAI o1/o3等模型通过生成大量中间推理标记强化了这一特性,这些标记在输出最终结果前构建了坚实的前提基础。本文提出Pre-Act方法,该创新方案通过为给定用户输入创建包含详细推理的多步骤执行计划来提升代理性能。该计划逐步整合先前步骤及工具输出,并在每一步执行后自我优化直至获得最终响应。我们的方法同时适用于对话型与非对话型代理。为全面评估任务导向型代理性能,我们提出两级评估框架:(1)轮次层面;(2)端到端层面。在Almita数据集上的实验表明,五个模型的平均轮次级评估中,Pre-Act方法在行动召回率上较ReAct提升70%。虽然该方法对大模型效果显著,但在实际应用中受延迟和成本限制的关键小模型往往难以胜任代理系统所需的复杂推理任务。为此,我们采用Pre-Act方法对Llama 3.1(8B & 70B)等较小模型进行微调。实验显示,微调后的70B模型在Almita(跨领域)数据集上表现优于GPT-4,其行动准确率(轮次级)提升69.5%,目标完成率(端到端)提高28%。
ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production
Abstract
arXiv:2505.09999v1 Announce Type: new Abstract: With the widespread adoption of Large Language Models (LLMs), serving LLM inference requests has become an increasingly important task, attracting active research advancements. Practical workloads play an essential role in this process: they are critical for motivating and benchmarking serving techniques and systems. However, the existing understanding of real-world LLM serving workloads is limited due to the lack of a comprehensive workload characterization. Prior analyses remain insufficient in scale and scope, thus failing to fully capture intricate workload characteristics. In this paper, we fill the gap with an in-depth characterization of LLM serving workloads collected from our worldwide cloud inference serving service, covering not only language models but also emerging multimodal and reasoning models, and unveiling important new findings in each case. Moreover, based on our findings, we propose ServeGen, a principled framework for generating realistic LLM serving workloads by composing them on a per-client basis. A practical use case in production validates that ServeGen avoids 50% under-provisioning compared to naive workload generation, demonstrating ServeGen's advantage in performance benchmarking. We will open-source ServeGen to foster future research.
摘要
随着大语言模型(LLMs)的广泛采用,处理LLM推理请求已成为日益重要的任务,并推动了相关研究的活跃进展。实际工作负载在此过程中起着关键作用:它们对激励和评估服务技术与系统至关重要。然而,由于缺乏全面的工作负载特征分析,目前对现实世界LLM服务负载的理解仍存在局限。先前研究在规模和范围上均显不足,因而未能充分捕捉复杂的负载特性。
本文通过深度分析从全球云推理服务平台收集的LLM服务负载,填补了这一空白。研究不仅涵盖语言模型,还包括新兴的多模态与推理模型,并在每种情况下揭示了重要的新发现。基于这些发现,我们提出了ServeGen——一种通过按客户端组合生成真实LLM服务负载的原则性框架。实际生产中的用例验证表明,与简单负载生成方法相比,ServeGen可避免50%的资源供给不足,证明了其在性能基准测试中的优势。我们将开源ServeGen以促进未来研究。
From Text to Network: Constructing a Knowledge Graph of Taiwan-Based China Studies Using Generative AI
Abstract
arXiv:2505.10093v1 Announce Type: new Abstract: Taiwanese China Studies (CS) has developed into a rich, interdisciplinary research field shaped by the unique geopolitical position and long standing academic engagement with Mainland China. This study responds to the growing need to systematically revisit and reorganize decades of Taiwan based CS scholarship by proposing an AI assisted approach that transforms unstructured academic texts into structured, interactive knowledge representations. We apply generative AI (GAI) techniques and large language models (LLMs) to extract and standardize entity relation triples from 1,367 peer reviewed CS articles published between 1996 and 2019. These triples are then visualized through a lightweight D3.js based system, forming the foundation of a domain specific knowledge graph and vector database for the field. This infrastructure allows users to explore conceptual nodes and semantic relationships across the corpus, revealing previously uncharted intellectual trajectories, thematic clusters, and research gaps. By decomposing textual content into graph structured knowledge units, our system enables a paradigm shift from linear text consumption to network based knowledge navigation. In doing so, it enhances scholarly access to CS literature while offering a scalable, data driven alternative to traditional ontology construction. This work not only demonstrates how generative AI can augment area studies and digital humanities but also highlights its potential to support a reimagined scholarly infrastructure for regional knowledge systems.
摘要
台湾的中国研究(CS)已发展成为一个丰富多元的跨学科研究领域,其形成受到台湾独特的地缘政治地位及与大陆长期学术交流的影响。为应对系统性重审与整合数十年来台湾CS学术成果的迫切需求,本研究提出一种人工智能辅助方法,将非结构化学术文本转化为结构化、可交互的知识表征。我们运用生成式人工智能(GAI)技术和大语言模型(LLMs),从1996至2019年间发表的1,367篇CS同行评议论文中提取并标准化实体关系三元组,随后通过基于D3.js的轻量级系统进行可视化,构建该领域专用知识图谱与向量数据库的基础架构。该基础设施使用户能探索语料库中的概念节点与语义关系,揭示未被发现的知识轨迹、主题集群与研究空白。通过将文本内容解构为图结构知识单元,本系统实现了从线性文本消费到基于网络的知识导航的范式转变,既提升了学者对CS文献的获取效率,也为传统本体构建提供了可扩展的数据驱动替代方案。本研究不仅展示了生成式AI如何增强区域研究与数字人文,更凸显了其支持区域性知识系统重塑学术基础设施的潜力。
MASS: Multi-Agent Simulation Scaling for Portfolio Construction
Abstract
arXiv:2505.10278v1 Announce Type: new Abstract: LLM-based multi-agent has gained significant attention for their potential in simulation and enhancing performance. However, existing works are limited to pure simulations or are constrained by predefined workflows, restricting their applicability and effectiveness. In this paper, we introduce the Multi-Agent Scaling Simulation (MASS) for portfolio construction. MASS achieves stable and continuous excess returns by progressively increasing the number of agents for large-scale simulations to gain a superior understanding of the market and optimizing agent distribution end-to-end through a reverse optimization process, rather than relying on a fixed workflow. We demonstrate its superiority through performance experiments, ablation studies, backtesting experiments, experiments on updated data and stock pools, scaling experiments, parameter sensitivity experiments, and visualization experiments, conducted in comparison with 6 state-of-the-art baselines on 3 challenging A-share stock pools. We expect the paradigm established by MASS to expand to other tasks with similar characteristics. The implementation of MASS has been open-sourced at https://github.com/gta0804/MASS.
摘要
基于大语言模型的多智能体系统因其在模拟和提升性能方面的潜力而受到广泛关注。然而,现有研究仅限于纯模拟或受限于预定义的工作流程,制约了其适用性和有效性。本文提出用于投资组合构建的多智能体规模化模拟(MASS)方法。MASS通过逐步增加智能体数量进行大规模模拟以深入理解市场,并通过逆向优化过程端到端优化智能体分布,而非依赖固定工作流程,从而实现稳定且持续的超额收益。我们在3个具有挑战性的A股股票池上,与6种最先进的基线方法进行了性能实验、消融研究、回测实验、更新数据和股票池实验、规模化实验、参数敏感性实验及可视化实验,验证了其优越性。我们期望MASS建立的范式能够扩展到具有类似特征的其他任务中。MASS的实现已开源在https://github.com/gta0804/MASS。
Leveraging Graph Retrieval-Augmented Generation to Support Learners' Understanding of Knowledge Concepts in MOOCs
Abstract
arXiv:2505.10074v1 Announce Type: new Abstract: Massive Open Online Courses (MOOCs) lack direct interaction between learners and instructors, making it challenging for learners to understand new knowledge concepts. Recently, learners have increasingly used Large Language Models (LLMs) to support them in acquiring new knowledge. However, LLMs are prone to hallucinations which limits their reliability. Retrieval-Augmented Generation (RAG) addresses this issue by retrieving relevant documents before generating a response. However, the application of RAG across different MOOCs is limited by unstructured learning material. Furthermore, current RAG systems do not actively guide learners toward their learning needs. To address these challenges, we propose a Graph RAG pipeline that leverages Educational Knowledge Graphs (EduKGs) and Personal Knowledge Graphs (PKGs) to guide learners to understand knowledge concepts in the MOOC platform CourseMapper. Specifically, we implement (1) a PKG-based Question Generation method to recommend personalized questions for learners in context, and (2) an EduKG-based Question Answering method that leverages the relationships between knowledge concepts in the EduKG to answer learner selected questions. To evaluate both methods, we conducted a study with 3 expert instructors on 3 different MOOCs in the MOOC platform CourseMapper. The results of the evaluation show the potential of Graph RAG to empower learners to understand new knowledge concepts in a personalized learning experience.
摘要
大规模开放在线课程(MOOCs)缺乏学习者与教师之间的直接互动,这使学习者在理解新知识概念时面临挑战。近年来,学习者越来越多地使用大语言模型(LLMs)来辅助获取新知识。然而,LLMs容易产生幻觉,这限制了其可靠性。检索增强生成(RAG)通过在生成响应前检索相关文档来解决这一问题。然而,非结构化的学习材料限制了RAG在不同MOOCs中的应用。此外,当前的RAG系统未能主动引导学习者满足其学习需求。为应对这些挑战,我们提出了一种图RAG流程,利用教育知识图谱(EduKGs)和个人知识图谱(PKGs)引导学习者在MOOC平台CourseMapper中理解知识概念。具体而言,我们实现了(1)基于PKG的问题生成方法,为学习者推荐上下文相关的个性化问题;(2)基于EduKG的问题回答方法,利用EduKG中知识概念之间的关系回答学习者选择的问题。为评估这两种方法,我们在CourseMapper平台上针对3门不同MOOCs课程与3位专家教师开展了研究。评估结果表明,图RAG在赋能学习者通过个性化学习体验理解新知识概念方面具有潜力。
Empirically evaluating commonsense intelligence in large language models with large-scale human judgments
Abstract
arXiv:2505.10309v1 Announce Type: new Abstract: Commonsense intelligence in machines is often assessed by static benchmarks that compare a model's output against human-prescribed correct labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a novel method for evaluating common sense in artificial intelligence (AI), specifically in large language models (LLMs), that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model's judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in their individual commonsense competence. Second, when used as simulators of a hypothetical population, LLMs correlate with real humans only modestly in the extent to which they agree on the same set of statements. In both cases, smaller, open-weight models are surprisingly more competitive than larger, proprietary frontier models. Our evaluation framework, which ties commonsense intelligence to its cultural basis, contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
摘要
机器常识智能通常通过静态基准测试进行评估,这些测试将模型的输出与人类预设的正确标签进行对比。这些标签隐含着一个重要假设:它们能准确反映所有人类的共识,实质上将人类常识视为同质化存在。然而最新实证研究表明,人类对常识的认知存在巨大差异——某个基准设计者认为不言而喻的结论,对其他人可能并非如此。为此,我们提出一种评估人工智能(尤其是大语言模型)常识的新方法,该方法通过测量模型判断与人类群体判断的对应关系,将实证观察到的人类异质性纳入考量。研究发现:首先,当被视为独立调查对象时,大多数大语言模型在个体常识能力上仍低于人类中位数水平;其次,当模拟假设人群时,大语言模型与真实人类在陈述认同度上仅呈现适度相关性。值得注意的是,在这两种情况下,较小规模的开源模型表现竟优于更大规模的专有前沿模型。我们的评估框架将常识智能与其文化基础相关联,响应了当前学界日益强烈的呼吁:需要使AI模型适应那些拥有不同(往往互不兼容)社会知识储备的人类群体。
Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models
Abstract
arXiv:2505.10543v1 Announce Type: new Abstract: While large language models demonstrate impressive performance on static benchmarks, the true potential of large language models as self-learning and reasoning agents in dynamic environments remains unclear. This study systematically evaluates the efficacy of self-reflection, heuristic mutation, and planning as prompting techniques to test the adaptive capabilities of agents. We conduct experiments with various open-source language models in dynamic environments and find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, a too-long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in crucial areas such as planning, reasoning, and spatial coordination, suggesting that current-generation large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while reasoning methods like Chain of thought improves multi-step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.
摘要
尽管大型语言模型在静态基准测试中展现出卓越性能,但其作为动态环境中自主学习和推理智能体的真正潜力仍不明确。本研究系统评估了自我反思、启发式变异和规划三种提示技术对智能体适应能力的提升效果。通过在动态环境中对多种开源语言模型进行实验,我们发现:首先,大模型通常优于小模型,但策略性提示能缩小这一性能差距;其次,过长的提示会损害小模型在基础反应任务中的表现,而大模型则展现出更强的鲁棒性;第三,高级提示技术主要提升小模型在复杂游戏中的表现,但对本已高性能的大模型改进有限。然而,我们发现高级推理方法会产生高度不稳定的结果——当推理与决策一致时可显著提升性能,但也可能引发不稳定并导致性能大幅下降。与人类表现相比,研究结果几乎没有发现真正涌现式推理的证据。当前大型语言模型在规划、推理和空间协调等关键领域仍存在持续局限,表明仅靠自我反思式提示可能无法完全克服这一代模型的根本缺陷。推理是多维度的任务,虽然'思维链'等方法能提升数学应用题的多步推理能力,但我们在动态基准测试中发现通用推理能力存在重要缺陷,这说明需要超越静态基准测试才能真正把握推理的复杂性。
An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs
Abstract
arXiv:2505.09724v1 Announce Type: cross Abstract: Analyzing texts such as open-ended responses, headlines, or social media posts is a time- and labor-intensive process highly susceptible to bias. LLMs are promising tools for text analysis, using either a predefined (top-down) or a data-driven (bottom-up) taxonomy, without sacrificing quality. Here we present a step-by-step tutorial to efficiently develop, test, and apply taxonomies for analyzing unstructured data through an iterative and collaborative process between researchers and LLMs. Using personal goals provided by participants as an example, we demonstrate how to write prompts to review datasets and generate a taxonomy of life domains, evaluate and refine the taxonomy through prompt and direct modifications, test the taxonomy and assess intercoder agreements, and apply the taxonomy to categorize an entire dataset with high intercoder reliability. We discuss the possibilities and limitations of using LLMs for text analysis.
摘要
分析开放式回答、新闻标题或社交媒体帖子等文本是一个耗时费力且极易产生偏差的过程。大型语言模型(LLMs)是文本分析的有力工具,既可采用预定义(自上而下)也可采用数据驱动(自下而上)的分类体系,同时不牺牲分析质量。本文通过研究者与LLMs之间的迭代协作流程,逐步演示如何高效开发、测试并应用分类体系来分析非结构化数据。以参与者提供的个人目标为例,我们展示了如何编写提示词来审阅数据集并生成生活领域分类体系,通过提示词调整和直接修改来评估优化该体系,测试分类体系并评估编码者间一致性,最终将该体系应用于整个数据集的分类工作且保持较高的编码者间信度。文中还探讨了使用LLMs进行文本分析的可能性与局限性。
System Prompt Optimization with Meta-Learning
Abstract
arXiv:2505.09666v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.
摘要
大型语言模型(LLMs)已展现出卓越的能力,其中优化输入提示对最大化其性能起着关键作用。然而,尽管LLM提示包含与任务无关的系统提示和特定于任务的用户提示,现有关于提示优化的研究主要关注针对单个查询或任务的用户提示,而很大程度上忽略了系统提示——这种提示一旦优化,便可跨不同任务和领域适用。基于此,我们提出了双层系统提示优化这一新问题,其目标是设计对多样化用户提示具有鲁棒性且可迁移至未见任务的系统提示。为解决该问题,我们提出一个元学习框架,通过在多个数据集上针对不同用户提示优化系统提示进行元学习,同时以迭代方式更新用户提示以确保二者协同。我们在涵盖5个不同领域的14个未见数据集上进行实验,结果表明该方法生成的系统提示能有效泛化至多样化的用户提示。此外,研究发现优化后的系统提示即使对未见任务也能实现快速适应,测试时的用户提示只需更少优化步骤即可获得性能提升。
Exploring the generalization of LLM truth directions on conversational formats
Abstract
arXiv:2505.09807v1 Announce Type: cross Abstract: Several recent works argue that LLMs have a universal truth direction where true and false statements are linearly separable in the activation space of the model. It has been demonstrated that linear probes trained on a single hidden state of the model already generalize across a range of topics and might even be used for lie detection in LLM conversations. In this work we explore how this truth direction generalizes between various conversational formats. We find good generalization between short conversations that end on a lie, but poor generalization to longer formats where the lie appears earlier in the input prompt. We propose a solution that significantly improves this type of generalization by adding a fixed key phrase at the end of each conversation. Our results highlight the challenges towards reliable LLM lie detectors that generalize to new settings.
摘要
近期多项研究指出,大语言模型(LLM)存在一个通用真实性方向,即在模型的激活空间中,真实陈述与虚假陈述呈线性可分状态。研究表明,仅针对模型单个隐藏状态训练的线性探针,就能在多个主题上实现泛化,甚至可能用于检测LLM对话中的谎言。本研究探讨了这种真实性方向在不同对话形式间的泛化能力。实验发现,模型在以谎言结尾的简短对话间泛化效果良好,但对谎言出现在输入提示较早位置的长对话格式泛化能力较差。我们提出了一种解决方案:通过在每段对话末尾添加固定关键词组,显著改善了此类泛化问题。研究结果凸显了开发能适应新场景的可靠LLM谎言检测器所面临的挑战。
Trustless Autonomy: Understanding Motivations, Benefits and Governance Dilemma in Self-Sovereign Decentralized AI Agents
Abstract
arXiv:2505.09757v1 Announce Type: cross Abstract: The recent trend of self-sovereign Decentralized AI Agents (DeAgents) combines Large Language Model (LLM)-based AI agents with decentralization technologies such as blockchain smart contracts and trusted execution environments (TEEs). These tamper-resistant trustless substrates allow agents to achieve self-sovereignty through ownership of cryptowallet private keys and control of digital assets and social media accounts. DeAgent eliminates centralized control and reduces human intervention, addressing key trust concerns inherent in centralized AI systems. However, given ongoing challenges in LLM reliability such as hallucinations, this creates paradoxical tension between trustlessness and unreliable autonomy. This study addresses this empirical research gap through interviews with DeAgents stakeholders-experts, founders, and developers-to examine their motivations, benefits, and governance dilemmas. The findings will guide future DeAgents system and protocol design and inform discussions about governance in sociotechnical AI systems in the future agentic web.
摘要
近期兴起的自治理去中心化人工智能代理(DeAgents)趋势,将基于大语言模型(LLM)的AI代理与区块链智能合约、可信执行环境(TEE)等去中心化技术相结合。这些抗篡改的无信任基础设施使代理能够通过掌控加密钱包私钥、数字资产及社交媒体账户实现自主治理。DeAgents消除了中心化控制并减少人为干预,解决了中心化AI系统固有的关键信任问题。然而鉴于大语言模型在可靠性(如幻觉问题)方面持续存在的挑战,这导致无信任机制与不可靠自主性之间形成悖论性张力。本研究通过访谈DeAgents利益相关方(专家、创始人与开发者),实证考察其动机、优势与治理困境,以填补该领域研究空白。研究结果将为未来DeAgents系统与协议设计提供指导,并推动关于未来代理网络社会技术AI系统中治理议题的讨论。
Evaluating Large Language Models for the Generation of Unit Tests with Equivalence Partitions and Boundary Values
Abstract
arXiv:2505.09830v1 Announce Type: cross Abstract: The design and implementation of unit tests is a complex task many programmers neglect. This research evaluates the potential of Large Language Models (LLMs) in automatically generating test cases, comparing them with manual tests. An optimized prompt was developed, that integrates code and requirements, covering critical cases such as equivalence partitions and boundary values. The strengths and weaknesses of LLMs versus trained programmers were compared through quantitative metrics and manual qualitative analysis. The results show that the effectiveness of LLMs depends on well-designed prompts, robust implementation, and precise requirements. Although flexible and promising, LLMs still require human supervision. This work highlights the importance of manual qualitative analysis as an essential complement to automation in unit test evaluation.
摘要
单元测试的设计与实现是许多程序员忽视的复杂任务。本研究评估了大型语言模型(LLMs)在自动生成测试用例方面的潜力,并将其与人工测试进行对比。通过开发一种集成代码与需求的优化提示模板,覆盖了等价类划分和边界值等关键测试场景。采用定量指标与人工定性分析相结合的方法,比较了LLMs与训练有素的程序员的优劣势。结果表明,LLMs的有效性取决于精心设计的提示模板、健壮的实现以及精确的需求描述。尽管LLMs具有灵活性和应用前景,但仍需人工监督。本研究强调了人工定性分析作为单元测试评估中自动化手段重要补充的必要性。
Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning
Abstract
arXiv:2505.09738v1 Announce Type: cross Abstract: Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.
摘要
预训练语言模型(LLMs)常受限于其固定的分词方案,这会导致效率低下和性能局限,尤其在多语言或专业应用中表现显著。这种分词器锁定现象带来了重大挑战,而现有标准解决方法通常需要极高的计算资源。尽管通过启发式初始化替换分词器旨在减轻负担,但现有方法往往需要大量残差微调,且可能无法完整保留语义细微差异或有效解决底层压缩效率问题。我们提出包含两项创新的框架:其一为TokenAdapt——一种模型无关的分词器移植方法;其二为针对多词超令牌的新型预分词学习机制,以提升压缩率并减少碎片化。TokenAdapt通过混合启发式策略初始化新唯一令牌嵌入,该策略结合两种方法:基于旧分词器子词分解的局部估计,以及利用原始词汇表中top-k语义相似令牌的全局估计。此方法旨在保持语义的同时显著减少再训练需求。实证研究验证了双重贡献:移植启发式成功初始化了唯一令牌,其表现显著优于传统基线方法(包括Transtokenizer和ReTok等复杂方法);而超令牌方案则实现了显著的压缩增益。零样本困惑度结果表明:在不同基础模型和新训练目标分词器中,TokenAdapt混合初始化策略产生的困惑度比率始终低于ReTok和TransTokenizer基线。相较于ReTok,TokenAdapt通常能将总体困惑度比率显著降低至少2倍。
Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models
Abstract
arXiv:2505.09805v1 Announce Type: cross Abstract: Clustering patient subgroups is essential for personalized care and efficient resource use. Traditional clustering methods struggle with high-dimensional, heterogeneous healthcare data and lack contextual understanding. This study evaluates Large Language Model (LLM) based clustering against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated cluster quality and distinctiveness. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight potential of LLMs for contextual phenotyping and informed decision-making in resource-limited settings.
摘要
患者亚群聚类对个性化诊疗和资源优化至关重要。传统聚类方法难以处理高维异构的医疗数据且缺乏上下文理解能力。本研究基于低收入国家(LIC)2,686例儿科脓毒症数据集(含28个数值变量和119个分类变量),对比评估了基于大语言模型(LLM)的聚类方法与经典方法。患者记录被序列化为包含/不包含聚类目标的文本,分别采用量化版LLAMA 3.1 8B、低秩适配(LoRA)的DeepSeek-R1-Distill-Llama-8B及Stella-En-400M-V5模型生成嵌入向量,并通过K-means进行聚类。经典方法包括UMAP降维和混合数据FAMD降维后的K-Medoids聚类。轮廓系数和统计检验评估了聚类质量与区分度。结果显示:Stella-En-400M-V5获得最高轮廓系数(0.86);带聚类目标的LLAMA 3.1 8B在较多簇数时表现更优,能识别具有显著营养状况、临床特征和社会经济差异的亚群。基于LLM的方法通过捕捉丰富上下文和关键特征优先级,全面优于传统技术。这些发现凸显了LLM在资源受限环境中实现情境化表型分析和循证决策的潜力。
Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLMs for Conflict Forecasting
Abstract
arXiv:2505.09852v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown impressive performance across natural language tasks, but their ability to forecast violent conflict remains underexplored. We investigate whether LLMs possess meaningful parametric knowledge-encoded in their pretrained weights-to predict conflict escalation and fatalities without external data. This is critical for early warning systems, humanitarian planning, and policy-making. We compare this parametric knowledge with non-parametric capabilities, where LLMs access structured and unstructured context from conflict datasets (e.g., ACLED, GDELT) and recent news reports via Retrieval-Augmented Generation (RAG). Incorporating external information could enhance model performance by providing up-to-date context otherwise missing from pretrained weights. Our two-part evaluation framework spans 2020-2024 across conflict-prone regions in the Horn of Africa and the Middle East. In the parametric setting, LLMs predict conflict trends and fatalities relying only on pretrained knowledge. In the non-parametric setting, models receive summaries of recent conflict events, indicators, and geopolitical developments. We compare predicted conflict trend labels (e.g., Escalate, Stable Conflict, De-escalate, Peace) and fatalities against historical data. Our findings highlight the strengths and limitations of LLMs for conflict forecasting and the benefits of augmenting them with structured external knowledge.
摘要
大型语言模型(LLM)在自然语言任务中展现出卓越性能,但其预测暴力冲突的能力尚未得到充分探索。本研究旨在验证LLM是否具备有意义的参数化知识——即编码于预训练权重中的知识——能否在不依赖外部数据的情况下预测冲突升级与伤亡情况。这对早期预警系统、人道主义规划及政策制定至关重要。我们对比了参数化与非参数化两种能力:前者仅利用预训练权重,后者则通过检索增强生成(RAG)技术获取冲突数据集(如ACLED、GDELT)和近期新闻报道的结构化与非结构化上下文。整合外部信息可补充预训练权重中缺失的最新背景,从而提升模型表现。我们构建的双阶段评估框架覆盖2020-2024年间非洲之角和中东等冲突高发地区。参数化实验中,LLM仅凭预训练知识预测冲突趋势与伤亡;非参数化实验中,模型接收近期冲突事件摘要、指标及地缘政治动态。通过将预测的冲突趋势标签(如"升级"、"稳定冲突"、"降级"、"和平")及伤亡数据与历史记录对比,本研究揭示了LLM在冲突预测中的优势与局限,并论证了结构化外部知识增强的重要价值。
Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph
Abstract
arXiv:2505.09945v1 Announce Type: cross Abstract: The advent of large language models (LLMs) has allowed numerous applications, including the generation of queried responses, to be leveraged in chatbots and other conversational assistants. Being trained on a plethora of data, LLMs often undergo high levels of over-fitting, resulting in the generation of extra and incorrect data, thus causing hallucinations in output generation. One of the root causes of such problems is the lack of timely, factual, and personalized information fed to the LLM. In this paper, we propose an approach to address these problems by introducing retrieval augmented generation (RAG) using knowledge graphs (KGs) to assist the LLM in personalized response generation tailored to the users. KGs have the advantage of storing continuously updated factual information in a structured way. While our KGs can be used for a variety of frequently updated personal data, such as calendar, contact, and location data, we focus on calendar data in this paper. Our experimental results show that our approach works significantly better in understanding personal information and generating accurate responses compared to the baseline LLMs using personal data as text inputs, with a moderate reduction in response time.
摘要
大型语言模型(LLMs)的出现使得诸多应用成为可能,包括在聊天机器人和其他对话助手中生成查询响应。由于训练数据量庞大,LLMs常出现高度过拟合现象,导致生成多余且错误的数据,从而引发输出中的幻觉问题。此类问题的根本原因之一在于缺乏及时、真实且个性化的信息输入。本文提出一种解决方案,通过引入基于知识图谱(KGs)的检索增强生成(RAG)技术,辅助LLM生成适应用户需求的个性化响应。知识图谱的优势在于能以结构化方式存储持续更新的真实信息。虽然我们的知识图谱可应用于多种频繁更新的个人数据(如日程、联系人和位置信息),但本文重点研究日程数据。实验结果表明,与将个人数据作为文本输入的基线LLMs相比,我们的方法在理解个人信息和生成准确响应方面表现显著更优,且响应时间仅有适度增加。
Reinforced Interactive Continual Learning via Real-time Noisy Human Feedback
Abstract
arXiv:2505.09925v1 Announce Type: cross Abstract: This paper introduces an interactive continual learning paradigm where AI models dynamically learn new skills from real-time human feedback while retaining prior knowledge. This paradigm distinctively addresses two major limitations of traditional continual learning: (1) dynamic model updates using streaming, real-time human-annotated data, rather than static datasets with fixed labels, and (2) the assumption of clean labels, by explicitly handling the noisy feedback common in real-world interactions. To tackle these problems, we propose RiCL, a Reinforced interactive Continual Learning framework leveraging Large Language Models (LLMs) to learn new skills effectively from dynamic feedback. RiCL incorporates three key components: a temporal consistency-aware purifier to automatically discern clean from noisy samples in data streams; an interaction-aware direct preference optimization strategy to align model behavior with human intent by reconciling AI-generated and human-provided feedback; and a noise-resistant contrastive learning module that captures robust representations by exploiting inherent data relationships, thus avoiding reliance on potentially unreliable labels. Extensive experiments on two benchmark datasets (FewRel and TACRED), contaminated with realistic noise patterns, demonstrate that our RiCL approach substantially outperforms existing combinations of state-of-the-art online continual learning and noisy-label learning methods.
摘要
本文提出了一种交互式持续学习范式,使得人工智能模型能够通过实时人类反馈动态学习新技能,同时保留已有知识。该范式独特地解决了传统持续学习的两个主要局限:(1) 采用流式实时人工标注数据进行动态模型更新,而非使用固定标签的静态数据集;(2) 通过显式处理现实交互中常见的噪声反馈,突破了传统方法对干净标签的假设。针对这些问题,我们提出了RiCL框架——一种基于大语言模型(LLMs)的强化交互式持续学习方法,可有效从动态反馈中学习新技能。RiCL包含三个核心组件:时序一致性感知净化器,用于自动识别数据流中的干净样本与噪声样本;交互感知直接偏好优化策略,通过协调AI生成反馈与人工反馈来实现模型行为与人类意图的对齐;以及抗噪声对比学习模块,通过挖掘数据内在关系来获取鲁棒表征,从而避免对潜在不可靠标签的依赖。在两个包含真实噪声模式的基准数据集(FewRel和TACRED)上的大量实验表明,我们的RiCL方法显著优于现有最先进的在线持续学习与噪声标签学习方法的组合方案。
CartoAgent: a multimodal large language model-powered multi-agent cartographic framework for map style transfer and evaluation
Abstract
arXiv:2505.09936v1 Announce Type: cross Abstract: The rapid development of generative artificial intelligence (GenAI) presents new opportunities to advance the cartographic process. Previous studies have either overlooked the artistic aspects of maps or faced challenges in creating both accurate and informative maps. In this study, we propose CartoAgent, a novel multi-agent cartographic framework powered by multimodal large language models (MLLMs). This framework simulates three key stages in cartographic practice: preparation, map design, and evaluation. At each stage, different MLLMs act as agents with distinct roles to collaborate, discuss, and utilize tools for specific purposes. In particular, CartoAgent leverages MLLMs' visual aesthetic capability and world knowledge to generate maps that are both visually appealing and informative. By separating style from geographic data, it can focus on designing stylesheets without modifying the vector-based data, thereby ensuring geographic accuracy. We applied CartoAgent to a specific task centered on map restyling-namely, map style transfer and evaluation. The effectiveness of this framework was validated through extensive experiments and a human evaluation study. CartoAgent can be extended to support a variety of cartographic design decisions and inform future integrations of GenAI in cartography.
摘要
生成式人工智能(GenAI)的快速发展为推进制图流程提供了新机遇。既往研究或忽视地图的艺术性,或难以兼顾地图的精确性与信息丰富性。本研究提出CartoAgent——一个基于多模态大语言模型(MLLMs)的新型多智能体制图框架。该框架模拟制图实践的三个关键阶段:准备阶段、地图设计阶段和评估阶段。每个阶段由不同MLLMs担任特定角色代理,通过协作、讨论和工具调用实现目标。CartoAgent尤其注重利用MLLMs的视觉审美能力和世界知识,生成兼具视觉吸引力与信息价值的地图。通过将样式与地理数据分离,该框架可在不修改矢量数据的前提下专注于样式表设计,从而确保地理精度。我们将CartoAgent应用于以地图重样式化(即地图风格迁移与评估)为核心的任务,通过大量实验和人工评估验证了其有效性。该框架可扩展至多种制图设计决策,并为生成式人工智能在制图领域的未来集成提供参考。
Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Tasks
Abstract
arXiv:2505.09901v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making tasks. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) tasks introduced in the cognitive science and psychiatry literature to conduct a comparative study of the E&E strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the E&E strategies of the agents and investigate how explicit reasoning, through both prompting strategies and reasoning-enhanced models, shapes LLM decision-making. We find that reasoning shifts LLMs toward more human-like behavior, characterized by a mix of random and directed exploration. In simple stationary tasks, reasoning-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas of improvements.
摘要
大型语言模型(LLMs)正日益被用于模拟或自动化人类在复杂序列决策任务中的行为。一个自然的问题是,LLMs是否表现出与人类相似的决策行为,并能达到相当(或更优)的性能。本研究聚焦于探索-利用(E&E)权衡这一不确定性下动态决策的基本问题。我们采用认知科学与精神病学文献中提出的经典多臂老虎机(MAB)任务,对LLMs、人类和MAB算法的E&E策略进行比较研究。通过可解释的选择模型捕捉智能体的E&E策略,并探究显式推理(通过提示策略和推理增强模型)如何影响LLM的决策。研究发现,推理使LLMs更趋近于人类行为特征,表现为随机探索与定向探索的混合。在简单静态任务中,具备推理能力的LLMs表现出与人类相近的随机和定向探索水平;而在更复杂的非静态环境中,尽管在某些场景下实现了相似的遗憾值,LLMs仍难以匹配人类的适应能力,尤其在有效定向探索方面存在不足。我们的发现既揭示了LLMs作为人类行为模拟器和自动化决策工具的潜力,也指出了其局限性,并为可能的改进方向提供了参考。
Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data
Abstract
arXiv:2505.09974v1 Announce Type: cross Abstract: The integration of large language models (LLMs) into cyber security applications presents significant opportunities, such as enhancing threat analysis and malware detection, but can also introduce critical risks and safety concerns, including personal data leakage and automated generation of new malware. We present a systematic evaluation of safety risks in fine-tuned LLMs for cyber security applications. Using the OWASP Top 10 for LLM Applications framework, we assessed seven open-source LLMs: Phi 3 Mini 3.8B, Mistral 7B, Qwen 2.5 7B, Llama 3 8B, Llama 3.1 8B, Gemma 2 9B, and Llama 2 70B. Our evaluation shows that fine-tuning reduces safety resilience across all tested LLMs (e.g., the safety score of Llama 3.1 8B against prompt injection drops from 0.95 to 0.15). We propose and evaluate a safety alignment approach that carefully rewords instruction-response pairs to include explicit safety precautions and ethical considerations. This approach demonstrates that it is possible to maintain or even improve model safety while preserving technical utility, offering a practical path forward for developing safer fine-tuning methodologies. This work offers a systematic evaluation for safety risks in LLMs, enabling safer adoption of generative AI in sensitive domains, and contributing towards the development of secure, trustworthy, and ethically aligned LLMs.
摘要
将大型语言模型(LLMs)整合至网络安全应用虽能带来显著机遇(如提升威胁分析与恶意软件检测能力),但同时也可能引发关键风险与安全隐患,包括个人数据泄露和自动化生成新型恶意软件。本研究对网络安全领域微调LLMs的安全风险进行了系统性评估。基于OWASP LLM应用十大风险框架,我们测试了七款开源LLMs:Phi 3 Mini 3.8B、Mistral 7B、Qwen 2.5 7B、Llama 3 8B、Llama 3.1 8B、Gemma 2 9B及Llama 2 70B。评估表明微调会普遍降低模型的安全韧性(例如Llama 3.1 8B在提示注入攻击下的安全评分从0.95降至0.15)。我们提出并验证了一种安全对齐方法,通过审慎重构指令-响应对以纳入明确的安全预防措施与伦理考量。该方法证实了在保持技术实用性的同时维持乃至提升模型安全性的可行性,为开发更安全的微调方法提供了实践路径。本研究为LLMs安全风险提供了系统化评估框架,有助于在敏感领域更安全地采用生成式AI,并推动开发安全、可信且符合伦理的LLMs。
Dark LLMs: The Growing Threat of Unaligned AI Models
Abstract
arXiv:2505.10066v1 Announce Type: cross Abstract: Large Language Models (LLMs) rapidly reshape modern life, advancing fields from healthcare to education and beyond. However, alongside their remarkable capabilities lies a significant threat: the susceptibility of these models to jailbreaking. The fundamental vulnerability of LLMs to jailbreak attacks stems from the very data they learn from. As long as this training data includes unfiltered, problematic, or 'dark' content, the models can inherently learn undesirable patterns or weaknesses that allow users to circumvent their intended safety controls. Our research identifies the growing threat posed by dark LLMs models deliberately designed without ethical guardrails or modified through jailbreak techniques. In our research, we uncovered a universal jailbreak attack that effectively compromises multiple state-of-the-art models, enabling them to answer almost any question and produce harmful outputs upon request. The main idea of our attack was published online over seven months ago. However, many of the tested LLMs were still vulnerable to this attack. Despite our responsible disclosure efforts, responses from major LLM providers were often inadequate, highlighting a concerning gap in industry practices regarding AI safety. As model training becomes more accessible and cheaper, and as open-source LLMs proliferate, the risk of widespread misuse escalates. Without decisive intervention, LLMs may continue democratizing access to dangerous knowledge, posing greater risks than anticipated.
摘要
大型语言模型(LLMs)正迅速重塑现代生活,推动从医疗保健到教育等诸多领域的发展。然而,在其卓越能力背后潜藏着重大威胁:这些模型对越狱攻击的脆弱性。LLMs易受越狱攻击的根本原因在于其学习的数据本身。只要训练数据包含未经过滤、有问题的或"黑暗"内容,模型就可能习得不良模式或弱点,使用户能够绕过其设计的安全控制机制。我们的研究发现了日益增长的"黑暗LLMs"威胁——这些模型被刻意设计为缺乏伦理约束,或通过越狱技术进行修改。在研究中,我们发现了一种通用越狱攻击方法,能够有效攻破多个最先进模型,使其能够回答几乎所有问题并根据请求生成有害输出。该攻击的核心思路早在七个多月前就已公开发布,但许多受测LLM仍存在此漏洞。尽管我们进行了负责任的披露,主要LLM提供商的应对措施往往不足,这凸显出行业在AI安全实践方面的严重缺陷。随着模型训练门槛降低、成本下降,以及开源LLMs的激增,大规模滥用的风险正在加剧。若不采取果断干预措施,LLMs可能持续推动危险知识的平民化,带来远超预期的风险。
The Evolving Landscape of Generative Large Language Models and Traditional Natural Language Processing in Medicine
Abstract
arXiv:2505.10261v1 Announce Type: cross Abstract: Natural language processing (NLP) has been traditionally applied to medicine, and generative large language models (LLMs) have become prominent recently. However, the differences between them across different medical tasks remain underexplored. We analyzed 19,123 studies, finding that generative LLMs demonstrate advantages in open-ended tasks, while traditional NLP dominates in information extraction and analysis tasks. As these technologies advance, ethical use of them is essential to ensure their potential in medical applications.
摘要
自然语言处理(NLP)在医学领域历来有广泛应用,而生成式大语言模型(LLMs)近年来逐渐崭露头角。然而,两者在不同医疗任务中的差异仍缺乏深入探讨。通过分析19,123项研究,我们发现生成式LLMs在开放式任务中展现出优势,而传统NLP则在信息提取与分析任务中占据主导地位。随着这些技术的进步,如何合乎伦理地运用它们对实现其在医疗应用中的潜力至关重要。
Private Transformer Inference in MLaaS: A Survey
Abstract
arXiv:2505.10315v1 Announce Type: cross Abstract: Transformer models have revolutionized AI, powering applications like content generation and sentiment analysis. However, their deployment in Machine Learning as a Service (MLaaS) raises significant privacy concerns, primarily due to the centralized processing of sensitive user data. Private Transformer Inference (PTI) offers a solution by utilizing cryptographic techniques such as secure multi-party computation and homomorphic encryption, enabling inference while preserving both user data and model privacy. This paper reviews recent PTI advancements, highlighting state-of-the-art solutions and challenges. We also introduce a structured taxonomy and evaluation framework for PTI, focusing on balancing resource efficiency with privacy and bridging the gap between high-performance inference and data privacy.
摘要
Transformer模型彻底改变了人工智能领域,为内容生成和情感分析等应用提供了强大支持。然而,其在机器学习即服务(MLaaS)中的部署引发了重大隐私问题,主要源于敏感用户数据的集中处理。私有Transformer推理(PTI)通过采用安全多方计算和同态加密等密码学技术,在保护用户数据和模型隐私的同时实现推理功能,为此提供了解决方案。本文综述了PTI领域的最新进展,重点介绍了前沿解决方案与现存挑战。我们还提出了一套结构化的PTI分类体系与评估框架,旨在资源效率与隐私保护之间实现平衡,并弥合高性能推理与数据隐私之间的鸿沟。
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
Abstract
arXiv:2505.10185v1 Announce Type: cross Abstract: Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate that this understanding enables performance gains: we can predict which strategy a model is likely to use and guide it toward more effective alternatives. Finally, we provide practical insights, such as that training data format (e.g., free-form vs. multiple-choice) has a far greater impact on reasoning behavior than data domain, underscoring the importance of format-aware model design.
摘要
长链思维(CoT)是现代大语言模型有效运用的关键要素,但我们对这些能力背后的推理策略理解仍然有限。尽管先前的一些研究尝试使用预定义的策略类型对CoT进行分类,但这类方法受限于人类直觉,无法全面捕捉模型行为的多样性。本研究提出“CoT百科全书”,一种自下而上的框架用于分析和引导模型推理。我们的方法自动从模型生成的CoT中提取多样化的推理标准,将其嵌入语义空间,聚类为代表性类别,并通过对比性评估标准解释推理行为。人工评估表明,该框架比现有方法产生更具可解释性和全面性的分析。此外,我们证明这种理解能够提升性能:可以预测模型可能使用的策略,并引导其采用更有效的替代方案。最后,我们提供实践洞见,例如训练数据格式(如自由形式与多项选择)对推理行为的影响远大于数据领域,这强调了格式感知模型设计的重要性。
Comparing LLM Text Annotation Skills: A Study on Human Rights Violations in Social Media Data
Abstract
arXiv:2505.10260v1 Announce Type: cross Abstract: In the era of increasingly sophisticated natural language processing (NLP) systems, large language models (LLMs) have demonstrated remarkable potential for diverse applications, including tasks requiring nuanced textual understanding and contextual reasoning. This study investigates the capabilities of multiple state-of-the-art LLMs - GPT-3.5, GPT-4, LLAMA3, Mistral 7B, and Claude-2 - for zero-shot and few-shot annotation of a complex textual dataset comprising social media posts in Russian and Ukrainian. Specifically, the focus is on the binary classification task of identifying references to human rights violations within the dataset. To evaluate the effectiveness of these models, their annotations are compared against a gold standard set of human double-annotated labels across 1000 samples. The analysis includes assessing annotation performance under different prompting conditions, with prompts provided in both English and Russian. Additionally, the study explores the unique patterns of errors and disagreements exhibited by each model, offering insights into their strengths, limitations, and cross-linguistic adaptability. By juxtaposing LLM outputs with human annotations, this research contributes to understanding the reliability and applicability of LLMs for sensitive, domain-specific tasks in multilingual contexts. It also sheds light on how language models handle inherently subjective and context-dependent judgments, a critical consideration for their deployment in real-world scenarios.
摘要
在自然语言处理(NLP)系统日益复杂的时代,大型语言模型(LLMs)已展现出在多样化应用中的显著潜力,包括需要细致文本理解和上下文推理的任务。本研究调查了多种前沿LLMs(GPT-3.5、GPT-4、LLAMA3、Mistral 7B和Claude-2)在零样本和少样本标注复杂文本数据集(包含俄语和乌克兰语的社交媒体帖子)方面的能力,特别关注识别数据集中涉及侵犯人权内容的二元分类任务。为评估这些模型的有效性,将其标注结果与1000个样本的人工双重标注黄金标准集进行对比。分析包括评估不同提示条件下(使用英语和俄语提示)的标注性能,并探究各模型表现出的独特错误模式和分歧,从而揭示其优势、局限性和跨语言适应性。通过对比LLMs输出与人工标注,本研究有助于理解LLMs在多语言环境下处理敏感领域特定任务的可靠性和适用性,同时揭示了语言模型如何处理本质上具有主观性和语境依赖性的判断——这对其实际场景部署至关重要。